DOCUMENT RESUME 



ED 325 484 




TM 015 668 


AUTHOR 


Frederiksen, John R.; 


Collins, Allan 


TITLE 


A Systems Approach to 


Educational Testing. Technical 




Report No^ 2* 




INSTITUTION 


Center for Technology 


in Education, New York, NY. 


SPONS AGENCY 


Office of Educational 


Research and Improvement (ED) , 




Washington, DC* 




PUB DATE 


Jan 90 




CONTRACT 


OERI-l-135562167-Al 




NOTE 


12p- 




PUB TYPE 


Viewpoints (120) — Reports - Evaluative/Feasibility 




(142) 




EDE3 PRICE 


MFOl Plus Postage. PC 


Not Available from EDRS. 


DESCRIPTORS 


Cognitive Development 


; Cognitive Tests; ^Educational 



Assessment; Educational Change; Elementary Secondary 
Education; Outcomes of Education; Skill Development; 
^Student Evaluation; ^Systems Approach; Tesr 



Construction; Testing Problems; *Test Validity 

ABSTRACT 

T)^e validity of educational tests used as critical 
measures of educational outcomes within a dynamic system is 
discussed. Validity becomes a problem if an educational system adapts 
itself to the characteristics of t\:e outcome measures. The concept of 
systematically valid tests is introduced; these tests induce 
curricular and instructional changes in education systems and 
learning strategy changes in students that foster the development of 
the cognitive traits the tests are designed to measure. Two 
characteristics are analyzed that contribute to or detract from a 
testing system's systemic validity: (1) use of direct rather tnan 
indirect cognitive assessment; and (2) the degree of subjectivity or 
judgment required to assign a score to represent the cognitive skill. 
These characteristics are then applied in developing design 
principles for creating systematically valid testing systems. These 
principles are illustrated in the design of a student assessment 
system that includes the means of teaching the process of assessment 
to system users. A list of 29 references is attached. (SLD) 
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Abstract 

Our concern in this paper is with the validity of educational tests when they are 
employed as critical measures of educational outcoi^es within a dynamic system. 
The problem of validity arises if an educational system adapts itself to the 
characteristics of the outcome measures. introdu. e the concept of systemically 
valid tests as ones that induce curricular and instructional changes in education 
systems (and learning strategy changes in students) that foster the development of 
thecognitive traits that the testsare designed tomeasure. We analyze some general 
characteristics that contribute to or detract from a testing system's systemic 
validity, such as the use of direct rather than indirect assessment. We then apply 
these characteristics in developing a set of design principles for creating testing 
systems that are systemically valid. Finally, we provide an illustration of the 
proposed principles by applying them to the design of a student assessmer/, system. 
This design example addresses not only specification's for the tests, but also the 
means of teaching the process of assessment users of the system. 



There are enormous stakes placed on stu- 
dents' p>erformanceoneducational tests. And 
8 thcrcare, consequently, enormous pressures 
g on school districts, school administrators, 
teachers, and students to improve scores on 
tests. Tliese pressures drive the educational system to 
modify its behavior in ways that will increase tc:.t 
scores (Darling-riammond & Wise, 1985; Madaus, 
1988). The test scores, rather than playing the role of 
passive indicator variables for the state of the system, 
become the currency of feedback within an adapting 
educational system. The system adjusts its ci rricular 
and instructional practices, and students adjust their 
learning strategies and goals, to maximize the scores 
on the tests used to evaluate educational outcomes, 
and this is particularly true when the stakes are high 
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(Corbelt & Wilson, 1988). Thus, for example, if a 
reading test emphasizes certain skills, such as knowl- 
edge of phonics, thL*n these become the skills that v\ il! 
receive emphasis in the reading curriculum. 

Our concern in this jper is with the validity of 
educational tests within such a dynamic system. To 
introduce tests .Mo a system that adapts itself to the 
characteristics of tests poses a particular challenge to 
their validity and '^alls into question many of the 
current practices in educational testing. Tliat chal- 
lenge to validity has to do with the effects of the 
instructional changes engendered by the use of the 
test and whether or not they contribute to the develop- 
ment of the knowledge and/or skills that the test 
purportedly measures. This extension of the notion uf 
construct validity of a test to take into account the 
effects of instructional changes brought about by the 
introduction of the test into an educational system we 
shall refer to as the systemic validity oi a test. A 
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systeiTiically valid test is one that induces in the edu- 
cation system curricular and instructional changes 
th?t foster the development of the cognitive skills that 
the test is designed to measure. Evidence for systemic 
validity would be an improvement in those skills after 
the test has been in place within the educational 
system for a period of time. 

Given this challenge to test validity due to sys- 
temic effects, the question we must take up has to do 
with whether there are any general characteristics of 
a system of testing that can be identified as either 
contributing to or detracting from a tesfs systemic 
validity. In our analysis, we shall identify a number of 
characteristics that contribute to systemic validity. 
We shall then apply these principles in developing a 
set of design principles for i alternative form of 
testing system that is systemically valid — one that we 
believe will drive the educational system toward 
practices that will lead to improvements in the under- 
lying ,,^owledge and skills that tests are seeking to 
measure. Finally, we shall provide an illustration oi 
the proposed principles, in the context of a student 
assessment system. (Elsewhere, we have applied the 
design pri nciples to teacher assessment, Collins & J. R. 
Frederiksen, 1989). 

Educational Systems as Dynamk Systems 

The measures that educators choose to use in assess- 
ing outco mes provide one important form of feedback 
that determines how the system will modify its future 
operat;:,n. Schoer/eld's (in press) observations of the 
teaching of one of the most successful muth teachersin 
New York Slate precisely illustrates oui point. Stu- 
dents of geometry in the state of New York must all 
passa statewide Regents' Exam that hasbecome, in no 
uncertain temis, the goal of instruction. Scores on the 
test are used to judge students, teachers, and school 
districts. In geometry, the exam includes as a major 
component a required proof (chosen from a list of a 
dozen theorems) and also a construction problem (in 
which tools such as a straightedge and a compass are 
used to "construct" a figure with specified proper- 
ties). In the scoring of the proofs, students are ex- 
pected to reproduce all the steps of the proof in a two- 
column forni, listing each proof step and a justifica- 
tion for that step. In the construction problem, they are 
not required to give justifications for the steps of the 
construction, but are graded on whether the construc- 
tion has all of the required arcs and linei and how 
accurately they are drawn. Schoenfeld found that 



these characteristics of the Regents Exam have com- 
pletely subverted the way the teacher taught geome- 
try. Instead of teaching students how to generate 
proofs, the ceacLer had students memorize the steps 
for each of the 12 proofs that might be on the exaia. In 
their constructions, the students were taught how to 
carry them out neatly. The students were thus able to 
pass the geometry part of the Regents' Exam with 
flying colors, but they did not learn how to reason 
mathematically. 

ThiS example illustrates how the systemic valid- 
ity of a test is dependent on the specification of the 
construct the test is takf^n to measure, which i:» in turn 
related to thegoaioof teachingand learning. If thegoal 
of teaching geometry is to be able to reproduce formal 
proofsand to de\ elop flawless constructions, then the 
Regents' geometry test can be said to be systemically 
valid. However, if the goal is to assess how students 
can develop proems and use constructions as tools for 
mathematical exploration, then the test cannot be said 
to be systemically valid, because its use has engen- 
dered instructional adaptations that do not contribute 
to the development of these cognitive skills. A test's 
validity cannot be evaluated apart from the intended 
use of the test (Messick, 1988). 

In the absence of feedback and adaptation to the 
test, the Regents' test and tests like it may provide an 
adequate indicaaon of students' knowledge, because 
most representative geometry items will correlate 
highly with one another and the use of one or another 
particular set of test items will not result, therefore, in 
any gross misclassification of test takers. Ho v\ ever, 
the requirement of systemic validity creates a much 
more stringent standard for the construction of tests, 
for it requires us to consider evolutions m the form 
and content of instruction and students' learning 
engeiidered by use of tlie test. That is, wih instruction 
that fo^'ises on the skills and problem formats repre- 
sented in ijsts promote tiie ability of students to 
engage, in the present case, in authentic mathematical 
investigations and problem solving? There are sev- 
eral reasons why we believe that it will not. 

1. If a test emphasizes isolated skill components 
and items of knowledge, instiuction that seeks to 
increase test scores ir> likely to emphasize those iIaW 
components rather than higher level processes (N. 
Frederikst.., 1984; Resnick &r Resnick, in press). 

2. Instruction that seels to develop specialized 
test-taking strategies (e.g., in taking a multiple choice 
test, trying to eliniinate one or more of the response 
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alternatives and then guessing) will not improve 
domain knowledge and skills. 

3. Time and effort spent in directly improving test 
scores in these ways will displace other learning ac- 
tivities that could more dii*ectly address the skills and 
learning goals the test was supposed to be measuring 
in the first place. 

4. Students will direct their study strategies to- 
ward those skills (such as memorization) that are 
represented on the tests — and that appear to be val- 
ued by educational institutions — lather than toward 
the use of cognitive skills and knowledge in solving 
extended problems. 

One solution to the problem of low systemic 
validity would be, of course, to disallow the develop- 
ment of any instruction aimed explicitly at improving 
scores on the test. Such an approach, however, would 
deny to the educational system theability to capitalize 
on one of its greatest strengths: to in\ ent, modify, as- 
similate, and m other ways improve instruction as a 
result of experience. No school should be enjoined 
from modifying its practices in response to their per- 
ceived success or failure. Nor should studenio be 
prevented from optinnizirig their study bO as to carry 
out the kinds of problem solving valued within their 
course of study. Yet if these strategic modifications in 
teaching and learning are to be based on test scores, 
then their efficacy will depend crucially on the sys- 
temic validity of the tests that are used. We are left, 
therefore, with the alternative solution to the prob- 
lem: to encourage the inventiveness and adaptability 
of educational systems by developing tests that di 
redly reflect and support the development of the aptitudes 
and traits they are supposed to measure. 

Charocferhfks of Sysfemkolly Valid Tezfs 

There ai e two dimensions or characteristics of tests 
that have a bearing on their usefulness as facilitators 
of educational improvement. These are (a) the direct- 
ness of cognitive assessment, and (b) the degree of 
subjectivity or judgment required in eissigning a score 
to represent the cognitive skill. 

In indirect tests, an abstract cognitive skill is meas- 
ured by evaluatinglessabstract, more directly observ- 
able features of performance that are known (or theo- 
retically expected) to be highly correlated with the 
abstract si .1. For example, verbal aptitude, a con- 
struct that might be defmed as "the ability to formu- 
late and express arguments in verbal form," is meas- 
ured using tests of vocabulary knowledge or verbal 
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analogies. In direct tests, the cognitive skill that is of 
interest is directly evaluated as it is expressed in the 
performance of some extended task. An example 
would be to rate the coherence of an argument in a 
legal brief. 

The deg, ee of subjectivity of a test refers to the 
degree to which judgment is used in assigning a score 
to a student's test performance. Objective tests use 
simple, algorithmic scoring methods such as counting 
the number of items correct. Subjective tests, on the 
other hand, require judgment, analysis, and reflection 
on the part of the .scorer in the assignment of a scure. 
Because the scoring algorithms of objective tests are 
simple, the item formats of such tests are usually con- 
structed to inv oke unitary responses, such as selecting 
one from a set of multiple-choice response alterna- 
tives or writing a single word, phrase, or number. 
Subjective tests do not necessitate this restriction on 
the form of response and typically allow more ex- 
tended responses tu a test item, such as the writing of 
an essay. Drew Gitomer (personal communication. 
May 8,1989) has pointed out that in objective tests, 
there is a low degree of inference required at the ilem* 
scoring level, but a much higher degree of inference 
required when items are aggregated using a p:^y- 
chometric model (e.g., item response theory^ factor 
analysis) to }, reduce a scale representing a particular 
construct. Subjective tests require, in contrast, more 
judgment and expertise in scoring at the item levci, 
but very little inferenv.e at the level of summarizing 
item level seores. In educational testing, objeetive 
tests are generally preferred because they reduce the 
scoring task to a simple, objective scoring algorithm 
such as a tallying of correct answers. Benefits of such 
objective tests are the reliability of scoring, the lack of 
potential biases that might affect score assignments, 
and the ease and economy of algorithmic seuring. 

Problems with using objective ^ests. We believe that 
one pays a very high price in redue jd systemic valid- 
ity for using objective tests. This is due to the faet that 
the desire for objective tests leads to tests that are 
indirect, and indirect tests often have probler^s of 
systemic validity. For example, in teacher assessment, 
competency can be assessed using tests of teachers' 
knowledge (domain knowledge and pedagogical 
knowledge) and basic skills (e.g., reading and mathe- 
matics). However, while such knowledge may be as- 
sociated with or even necessary for effective pra'tice 
as a teacher, it doeo not provide direct evidence of 
such practice, nor will developing such knowledge 



ensure more effev.tive teaching. Similar remarks can 
be made about tests of factual knowledge as a meas- 
ure of accomplishment at the end of a course in history 
or tests of vocabulary knowledge as a measure of the 
capacity to do college work. In general, objective tests 
emphasize low-level skills, factual knowledge, memo- 
rzation of procedures, and isolated skills, and these 
are aspects of performance that correlate vAih but do 
not constitute the flexible, high-level skills needed for 
generating arguments and constructing solutions to 
problems (N. Frederiksen, 1989; Resnick & Resnick, i n 
press). Use of objective tests thus leads to teaching 
strategies that emphasize the conveying of informa- 
tion and to studentlearning strategies that emphasize 
memorization of lacts and procedures, rather than 
learning to generate solutions to problems— includ- 
ing novel problems that occur in "red life" contexts. 
N. Frederiksen (1984) has termed thiseffectof tests on 
the content of instruction "the real test bias." 

In some cases, it may be possible to construct 
objective tests that are direct measures of important 
cognitive constructs, such as identifying mental models 
in physics (Qement, 1982; McCloskey, Caramazza, & 
Green, 1 980; McDermott, 1984; White, 1983) or a£3e55- 
ing creativity in scientific problem solving (N. 
Frederiksen, 1978). It may also be possible to use tech- 
niques of artificial intelligence to build relatively 
detailed models of students' knowledge on the basis 
of extended examples of their problem solving 
(Anderson, Boyle, & Reiser, 1985; Clancey, 1983; J. R. 
Frederiksen & White, 1989; Johnson & Solo way, 1985; 
Sleeman & Brown, 1982). Although it is worthwhile to 
continue efforts to develop objective tests of impor- 
tant cognitive outcomes of learning, in general the 
state of the art does not permit objective tests for 
directly measuring higher order thinking skills, prob- 
lem-solving strategies, and metacognitive abilities 
involved in tasks such as teaching, writing, construct- 
ing a historical argument, and "doing" mathematics. 
Thus we believe that it is important to consider some 
of the advantages of subjective, direct assessment of 
such high-order cognitive skills. 

Advantages of direct tests. Direct tests attempt to 
evaluate a cognitive skill as it is expressed in the 
performance of extended tasks. Such measures are 
systemically va!:d, because instruction that improves 
the test score will also have improved performance on 
the extended task and the expression of the cognitive 
skill within the task context. In figuie sk ting and 
gymnastics, for example, measures of traits such as 
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technical merit and artistic impressior are assigned 
by judges based on an extended program that isdevel- 
oped and performed by the athlete. 

In educational testing, a particularly good ex- 
ampleof thisapproach (and one that hasbeen seminal 
in influencing our thinking) is the primary trait sys- 
tem for scoring writing tasks that was developed by 
the National Assessment of Educational Progress 
(NAEP) (Mullis, 1980). The purpose of the NAEP 
assessment was to measure whether a i^icceof writing 
is successful or unsuccessful in achieving a particular 
purpose. The student is given a writing assignment 
with a particular goal, such as writing a letter to the 
chairman of the school board on the advisability of 
instituting a 12-month school year. To evaluate such 
writing, a set of primary traits was developed that are 
important for successfully achieving the goal of the 
writing assignment. For example, one primary trait, 
persuasiveness, involves the presentation of a set of 
logical and compelling arguments. The completed 
writing exercise is rated on a set of sucn primary traits, 
using a simple 4-point scale for each. For example, 
persuasiveness is rated as follows: 'T' for a paper 
containing no reasonable argument, '2" foi a paper 
having one or two poorly thought out arguments, "3" 
for a paper containing several logically thought out 
reasons, and "4" for a paper containing in addiCiu.i a 
number of compelling details (Mullis). 

Basing educational assessmentun such subjective 
scoring requires that scorers understand the scoring 
categories and be taught how to use them reliably. 
This in turn necessitates building a library of exem- 
plars of student work representing different levels of 
the desired primary traits. This library is then u^^d to 
train scorers to assess the traits. In the case of the 
NAEP writing assessment, for each writing exercise, 
exemplars of texts scored in each category are pro- 
V iutd. In addition, a detailed rationale is included fur 
each exemplar explaining why the particular score 
has been assigned. Assessors study these exemplars 
and practice scoring until they have internalize ^ the 
criteria and can rate primary trait performance Jia- 
b!y in a variety of task contexts. In the NAEP primary 
trait assessment of writing, a typical interscorer agree- 
ment of 91%-95% was achieved. Moreover, studies 
have shown that individual, remote scorers, follow- 
ing calibration (Braun, 1986), can provide scores that 
approach quite closely the values derived using stan- 
daidized scoring methods (Breland & Jones, 1988). 

It would be difficult to justify the cost of develop 
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ing these training materials if they were to be used 
only to train professional assessors. However, there is 
another use to which they can be put; The training 
materials can become the medium for comtmmicating to 
teachers and students the critical traits to look for in good 
writing, good historical analysis, and good problem solving. 
The library of exemplars can be viewed as a set of 
''case studies" that can be used by teachers to make 
their students aware of the nature of expert perform- 
ance, or as Wolf puts it, to help them ''develop a keen 
sense of standards and critical judgment" (1987, p. 26). 
Using them, students can learn to assess their own 
work in the same way that their teachers will judge it. 
They can, for example, learn to recognize critical traits 
in their writing and to carry this awareness along with 
then as they carry out their assignments. The assess- 
ment system provides a basis for developing a meta- 
cognitive awareness of v/hat are important character- 
istics of good problem solving, good writing, good 
experimentation, good historical analysis, and so on. 
Moreover, such an assessment can address not only 
the product one is trying to achieve, but also the 
process of achieving it, that is, the habits of mind that 
cc ^^bute to successful writing, painting, and prob- 
lem soiving (Wiggins, 1989). We believe that building 
such awareness will lead to genuine improvements in 
the cognitive traits on which the assessment system is 
based.^ We argue, therefore, that adopting subjective, 
direct assessment is a good way to increase the sys- 
temic validity of a testing system. 

Principles for the Design of 
Systemically Valid Testing 

Our plan for the design of a systemically valid testmg 
sy item has three major aspects: (a) the components of 
the testing system; Qd) the standards to be sought in 
the design of the system; and (c) the methods by which 
the system encourages learning. A general outline of 
the design specification will be presented in this sec- 
tion. In the subsequent section, we will illustrate the 
applications of this design for a student assessment 
system. 

Components of the Testing System 

The testing system we envision has four majorcompo- 
nents: a set of tasks,a sp&cificn'ion of primary traits to 
be assessed, a library of exemplars of performances on 
each task, and a training, system for teaching hovv to 
score the primary traits. 



Set of tasks. The tests should consist of a ropre- 
sentativeset of tasks thatcovcr the spectrum of knov\l- 
edge, skills, and strategies, needed for the activity or 
domain being tested. For example, in student asbcbs- 
ment, if there is a set of basic j;roblem-solving skills we 
think students should acquire, these skills must be 
called for in the tasks given. The tasks might be con- 
structed as in the assessment of figure skating, a set ^ f 
compulsory tasks plus a set cf elective tasks, so that 
testees can demonstrate both their basic abilities in 
compulsory tasks and the'rplanning and creativity in 
elective tasks. The tasks should be authentic, ecologi- 
cally valid tasks in that they are representative of the 
ways in which knowledge and skills are used in 'real 
world" contexts (Brown, Collins, & Duguid, 1989, 
Wiggins, 1989). 

Primary traits for each task afid subprocess. The 
knowledge and skills used in performing any task 
may consist of distinct subprocesses. Fur example, 
teaciiing might be broker down into planning, class- 
room practice, and evaluating students' w urk, ca^h uf 
which requires somewhatdifferent talents. These sub- 
processes need to be assess> d independently so th» t 
test takers will direct their etforts to doing vy^eil in all 
phases of the task domain being tested. Each sub- 
process must be characterized by a small number of 
primary traits or characteristics that cover the knowl- 
edge and skills necessary to do well in that aspect uf 
the activity. The traits shctild cover both process and 
productsand should include planning and reflection. 
For example, in writing, processes might include note 
taking, outlining, drafting, and revising. The primary 
traits for expository writing might be clarity, persua- 
siveness, memorability, and enlicingncss (Collins & 
Genlner, 1980). (The specific traits may differ for dif- 
ferent processes and products.) The primary traits 
chosen should be ones that the test takers should 
strive to achieve, and thus should be traits that are 
leamable. The small number is necessary to focus the 
test taker's learning. The particular traits chosen for 
any task domain are not too critical, as long as they 
cover the skills that are judged to be important and 
they are leamable. Ir* other words, we believe that the 
testing approach is robust over different i>ets of prl 
mary traits. 

A library of exemplars. In order to ensure reliabil- 
ity of scoring and leamability, it is important that for 
each task there be a library of exemplars of all levels of 
performance for each primary trait assessed in the 
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test. The library should include exemplars represent- 
ing the different ways to do well (or poorly) with re- 
spect to each trait. It should also include critiques of 
each sample performance, so that it is clear how the 
performance was judged . The library should be acces- 
sible to all, and particularly to the testees, so that they 
can learn to assess their own performance reliably and 
thus develop clear goals to strive for in their learning. 

A traifiiftg system for scoring tests. There are 
three groups that must learn to score test performance 
reliably (a) the administrators of the testing system, 
who develop and maintain the assessment standards 
(i.e., master assessors); (b) the coaches in the testing 
system whose role is to help test takers to perform 
better; and (c) the test takers themselves, who must 
internalize the criteria by which their work is being 
judged. The master assessors are charged with defin 
ing the criteria, ensuring that test performance can be 
sccrxi reliably, and training coaches to score per- 
fcrmances» The coaches work with the test takers to 
teach them self-assessment. 

Standards 

Standards must be developed for the testing system 
that include the following: 

Directness, From a systems point o' .lew, we 
have seen that it is essential that whatever knowledge 
and skills we want test takers to develop be measured 
directly. Sometimes this may require measuring a 
process, soiretimns a product, and sometimes both. In 
either cise, any indirectness in the measure will lead 
to a misdirection of learning effort by test takers to the 
degree that it matters to them to do well on the test. 

Scope. The test should cover, as far as possible, all 
the knowledge, skills, and strategies required to do 
well in th e activity. To the degree that any knowledge 
or skills are left out, test takers will direct their learn- 
ing efforts to only part of what is required of them. 

Reliability. We think that the most effective way 
to obtain reliable scoring that fosters learning is to use 
primary trait scoring 'sorrowed from the evaluationof 
writing. Developing a primary trait system for any 
test involves the same steps that were used by NAEP 
in applying it to writing. 

Transparency. The terms in which the tf'st takers 
are judged must be clear to them if a 'est is to be 
successful in motivating and directing learning 
(Wiggins, 1989). In fact, we argue that the test must be 
transparent enough so that they Can assess themselves 
and others with almost the same reliability as the 
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actual test evaluators achieve. 

Methods for Fostering 
Improvement on the Test 

The testing system should not only employ forms of 
assesb.nent that enhance learning, but it should also 
includespecificmethodsdesignei.' .j foster such learn- 
ing. These include the following. 

Practiceinself-asscssment.lhe test takers should 
have ample opportunity to practice taking the test and 
should have coa ^ing to help them assess how well 
they have done and why. This kind of reflection on 
performance(Collins& Brown, 1988) is made possible 
by recording technologies sucl' as videotape and 
computers. The assistance of a coach, who has inter- 
nalized the testing standards, is critical to helping the 
test takers see their performance through others' eyes. 

Repeated testing. Although it may be necessary to 
have the test administered at only a few times during 
a year, it is still important to encourage students to 
take the test multiple times to encourage striving for 
improvement. If what is measured by the test is in\- 
portant to learn, then the test should not be taken once 
and forgotten. It should serve as a beacon to guide 
future learning. 

Feedback on test perfonnance. Whenever a per- 
son takes the test, th :re should be a "rehash" with a 
master assessor or teacher. This rehash should em- 
phasize what the testee did well and poorly on, and 
how performance might be improved. It should pref- 
erably involve a master assessor so that the institu- 
tionalized standards will be clear to the test taker. 

Multiple levels of success. There should be vari- 
ous landmarks of success in performance on the test, 
so that students can strive for higher levels of per- 
formance in repeated testing. The landmarks or levels 
nught include such labels as "beginner," "intermedi- 
ate," and "expert" to motivate attempts to do better. 

Student Assessment 

The system we envision involves developmg a num- 
ber of extended tasks or projects that students would 
carry out to demonstrate their mastery of courses they 
are taking, such as history or physics. We can ill usirate 
the approach with t\/o structured tasks that might be 
given to students in American history and physics. 
For history, a task might be as follows: "At the begin- 
ning of World War II, the United States w us divided as 
to whether to enter the war or to stay neutral. Pick 
three presidents in history, other than Franklin Roose- 
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velt, who you think would have taken different posi- 
tions on the ibsue, and write a 2-minute speech of each 
to the American pubhc on what should be done in that 
situation." These speeches might then be delivered 
and recorded on videotape, with questions following 
from other students as in a press conference. For 
physics, the task might be to design a set of activities 
using a Dynaturtle (diSessa, 1982; White, 1984) that 
would help younger students learn to understand 
Newton's Laws of Motion. (A Dynaturtle ^s an object 
in a computer simulation that operates in a friction- 
less, gravity -free environment, and is controlled like a 
spaceship.) These are examples of the kind of ex- 
tended tasks that students could be given to demon- 
strate their understanding of history or science. A 
variety of such taskscould be provided to teachers for 
use in assessment, or teachers could construct their 
own tasks followinga set of task bpecifications that are 
provided to thorn. In general, th tasks to be included 
within an assessment system would vary from struc- 
tured tasks that measure students' understanding of 
critical concepts or skills to open-ended tasks that 
allow studentsto demonstrate special kno wledgeand 
creativity. Ideally, these tasks would be fully inte- 
grated within a course, rather than scrvi.ng as accesso- 
ries to the course. 

Scoring Student Performance 

Students would be evaluateu on the tabks in terms of 
a set of primary traits. Examples of primary traits that 
could be used are (a) clarity of expression, (b) creativ- 
ity, (c) depth of understanding or thoroughness, (d) 
consideration of multiple perspectives, and (e) focu*' 
or coherence. The particular H*aits chosen are, again, 
i.ot critical so long as they covl; the c'esircd qualities 
and direct students' efforts appropnately. The pri- 
mary traits would cover both process and products, 
and also might be applied to different phases of an 
assessment task, such as planning, presentation, and 
revision. 

To implement the assessment system, '.t is impor- 
tant to build a library of exemplars of students work* 
ing on a variety of tasks, covering all the major subject 
areas. This library would be embodied in paper, vide- 
otapes, and computer tr^ ls. For example, paper rec- 
ords might include notes, ouMines,and multipledrafts 
of articles written. Videotapes might record students 
discussing their initial plans, making presentations, 
answering questions, or performing dramatic scenes. 
Computers might record document preparation and 



revision or students' solutions to problems such as the 
physics activity described above. EvKh oi thcbc uxciiv 
plarsshouldalso containacritiqicofthepurfomunLC 
by master assessors in terms of the set ^ primar>' trviits 
chosen for evaluating stu 'ents. 

The administration for such a system could be 
ccntced at the school, district, state, or uvun national 
level. There would have to be a group of master 
assessors who are responsible for dev duping the set 
of traits, the criteria for scoring, and the library of 
exemplars. They would also be responsible for shov\ - 
ing teacheis how to evaluate student performanLC, 
ana in fact testing teachers to make sure that they hav e 
internalized the evaluation criteria, Tcaci.crs would 
function as coachti to the students as they practiced 
different tai)ks, to help them internalize the criteria b) 
which thc) arc judged. Ideally, students would learn 
how to critique their own and each uther'o perform- 
ances in terms of the prinary traits adopted. 

Addressing Different Audiences 

A major problem in student assessment is that thc test 
scores generated have to address the needs and de- 
sires of many different audiences. Colleges need to 
know whether the studr it meets their ad.nission 
standards. Teachers want to know what students 
have learned and failed to learn. Parents and students 
want to know how the student is doing relative to 
some standard. Administrators want to know how 
well different teachers and schools are succeeding. All 
of these different needs have to be balanced in setting 
up an assessment system. 

Because colleges are a major constituenc) fur 
student assessment, the criteria for evaluating stu 
dents m each subject sh iu\d be dev eluped in cunjunc- 
tion With college admissions officers, w ho hav e ideas 
about what are essential knowledge and skills for 
admission. (For students in vocational courses, crite- 
ria should be developed in consultation with busi 
nesses and other potential employersand with licens- 
ing boards.) These same criteria should suffice fur 
parents, students, and teachers, since they are the 
outcome measures that are valued by colleges ur 
future employers, and are therefore ecologically \ alid 
measures of performance that are judged to be inipur 
tant in "real world" tasks. 

A Changing Role for Testing Organizations 

l-^st the proposal for a systemically valid testing s>s 
tem we have made oCQxn overly visionar)', we shall 
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examine briefly the practical side of implementing 
sucha system. We believe that the efficiency in current 
testing practices is greatly outweighed by the cost of 
using a system that has low systemic validity—one 
that has a negative impact on learning and teaching. 
The goal of assessmen t has to be, above all, to support 
the improvement of learning and teaching. To accom- 
plish *his, major changes mu. i occur in the role and 
function of testing organizations. In the future, they 
will retain their important role as developers of as- 
sessment tools, and they will, as now, be responsible 
for setting scon ng standards and practices. However, 
they will have to assume some new responsibilities: 
(a) they mustdevelop materials for use in teachingthe 
assessment techniques, not only to master assessors 
within schools and school districts, but also to teach- 
ers and students; and (b) they must take responsibil- 
ity for ensuring that the assessment standards are as- 
similated and maintained by these new groups of 
assessors. The oig difference is that the practice of 
assessment will no longer be confined to the testing 
organizations; it will become more decentralized, as 
teachers and students are taught to internalize the 
standards of performance for which they are to strive. 

We end with some caveats. Clearly, much re- 
search needs to be done to test the assumptions on 
which our proposal is based: Can primary traits be 
assessed reliably on a common scale when the par- 
ticular tasks that test takers carry out may vary? Does 
an awareness of primary traits help students to im- 
prove performance on projects and teachers to be- 
come more effective in the classroom? Can a consen- 
sus be reached on wha t are appropriate primary trai ts 
fordifferentdomainsandactivities?Canscoringstan- 
dardsbe met when assessmen t is det ^ntralized? These 
and other questions should become the basis of a con- 
certed research effort in support of a new, systemi- 
cally valid system of educational testing. 



ERIC 



Notes 

This work was supported by ihcCcnlcr for Technology 
in Education under Grant No. M35562167-A1 from ihc 
Office of Educational Research and Improvement, U.S. 
Dep: nmcnl of Educalion, to Bank Street College of Educa- 
tion. We would like to thank Norman Fredcnksen, Drew 
Cilomcr,RobcrtGlascr,and Ray Nickersonforthvjir thought 
ful comments on an earlier draft of the paper. 

1. A critical assumption is that scorers can learn to 
recognize and reliably assess primary traits^ not only in the 
particular tasks used in the librr.ry of exemplars, but in other 
tasks for which the trait is relevant. Although there is evi- 
dence bearing on these assumptions in the assessment of 
writing (Brcland & Jones, 1988), further work will be re- 
quired to check its validity for the specific primary traits that 
are to be the goal of assessment. 
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